Support Vector Machine - Capstone Project

Author

Gopi Shankar Reddy Mallu, Kavya Reddy Maale, Satya Nageswara Dinesh Donkada, Vamsi Krishna Kalla

Published

March 14, 2024

Introduction

Support Vector Machines (SVMs) are a type of supervised learning algorithm that can be used for classification or regression tasks. The main idea behind SVMs is to find a hyperplane that maximally separates the different classes in the training data. This is done by finding the hyperplane that has the largest margin, which is defined as the distance between the hyperplane and the closest data points from each class. Once the hyperplane is determined, new data can be classified by determining on which side of the hyperplane it falls. SVMs are particularly useful when the data has many features, and/or when there is a clear margin of separation in the data.

Fig: - Linear & Non-Linear Separable Data

Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.

Fig: - Hyperplane in 2D & 3D feature space

Literature Review

Support Vector Machines (SVM) have emerged as a potent tool in the realm of supervised learning, offering a robust mathematical framework for both classification and regression tasks. With a foundation rooted in principles such as structural risk minimization and kernel functions (Jakkula, 2006), SVM has demonstrated exceptional generalization capabilities, adeptly handling non-linear decision boundaries through kernel tricks (Jun, 2021; Deris, 2011). Despite challenges like computational cost and scalability (Bhavsar, 2012), the evolution of SVM has led to significant contributions in diverse fields, including pattern recognition, computer vision (Kecman, 2005), and even agriculture, where it aids in optimizing crop yield and disease identification (Kumar et al., 2017). The ongoing advancements in SVM research are geared towards refining algorithms and broadening their application spectrum, especially in the context of burgeoning data volumes (Yue, 2003).

In the financial and healthcare sectors, SVM has proven its efficacy in various applications. It has been utilized to construct reliable stock market prediction models by analyzing financial indices like Earnings Per Share (EPS) and Net Profit Growth Rate (NPGR) (Han, 2007). In healthcare, SVM has been instrumental in developing advanced diagnostic tools, such as the optimized SVM model for early dementia prediction (Javeed et al., 2023) and the multi-disease prediction model using an improved SVM-radial bias kernel approach (Harimoorthy & Thangavelu, 2021). These innovations underscore the potential of machine learning in revolutionizing healthcare by facilitating early diagnosis and personalized treatment plans.

SVM’s application extends to domains like online retail and network security, where it addresses complex challenges with remarkable efficiency. In online marketplaces, SVM combined with Particle Swarm Optimization has enhanced the accuracy of text classification for customer reviews (Sahara et al., 2023), providing valuable insights for sellers. In the realm of network security, innovative approaches such as combining SVM with naïve Bayes feature embedding have been proposed for intrusion detection, achieving high accuracy rates in identifying network threats (Jie Gu et al, 2021). Moreover, the development of hybrid methods for attack detection, which integrate SVM features with evolutionary algorithms and artificial neural networks, has shown significant promise in reducing dimensionality and training time while maintaining high detection accuracy (Soodeh Hosseini et al, 2020)

Machine learning techniques, particularly SVM, are revolutionizing various fields by addressing complex challenges with precision and efficiency. In healthcare, SVM has been applied to electronic health records for cancer classification, achieving high accuracy rates in identifying different types of malignancies (K. Ghanem et al, 2021). Furthermore, SVM’s versatility is evident in its application across domains such as finance, where it has been used to assess credit risk for small and medium enterprises in supply chain finance (Zhang, Hu, & Zhang, 2015), and in cloud-based services, where it ensures data confidentiality and decision verifiability in health monitoring systems (Liang et al., 2021). These advancements highlight the transformative potential of machine learning techniques in enhancing diagnostic accuracy, optimizing financial assessments, and ensuring secure cloud-based services.

Dataset

Customer retention is a critical aspect for banks to ensure the sustainability of their operations. ABC Multinational Bank, in particular, places a strong emphasis on retaining its account holders. The primary objective of this analysis is to examine the customer data of the bank’s account holders to predict and prevent customer churn effectively.

The dataset under consideration contains information about account holders at ABC Multinational Bank, with the ultimate goal of predicting customer churn. The dataset comprises the following columns:

Column Name Description
customer_id A unique identifier for each customer, not used in the analysis.
credit_score A numerical representation of the customer’s creditworthiness.
country The country in which the customer resides.
gender The gender of the customer (e.g., male, female).
age The age of the customer in years.
tenure The number of years the customer has been with the bank.
balance The current balance in the customer’s account.
products_number The number of products the customer has with the bank.
credit_card Indicates whether the customer has a credit card with the bank.
active_member Indicates whether the customer is an active member.
estimated_salary The estimated annual salary of the customer.
churn The target variable, indicating customer churn (1 for churned, 0 for not churned).

Source: - Bank Churn Dataset

Methodology

Mathematical Intuition of Support Vector Machine

Consider a binary classification task where there are two classes, denoted by the labels +1 and -1. The input feature vectors (X) and the matching class labels (Y) comprise our training dataset.

Equation for hyperplane can be written as:

\(w^Tx+b=0\)

The vector W represents the normal vector to the hyperplane. i.e the direction perpendicular to the hyperplane. The parameter b in the equation represents the offset or distance of the hyperplane from the origin along the normal vector w.

\(d_i = \frac{w^Tx_i + b}{\|w\|}\)

where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of the normal vector W

\(\hat{y} =\begin{cases} 0 & \text{if } w^T x + b \geq 0 \\ 1 & \text{if } w^T x + b < 0 \end{cases}\)

kernel function in SVM

In Support Vector Machines (SVM), the kernel function plays a crucial role in transforming the input feature space into a higher-dimensional space where the data can be linearly separated. This is particularly useful in cases where the data is not linearly separable in its original space. The kernel function computes the dot product between the feature vectors in this higher-dimensional space without explicitly mapping the vectors into that space, which is known as the “kernel trick.”

Common types of kernel functions include:

  • Linear Kernel: \(K(w,b)=w^Tx+b\). This is the simplest form of the kernel, used when the data is linearly separable.

  • Polynomial Kernel: \(K(w, b) = (1 + w^T.x b)^d\). This kernel maps the input features into a polynomial feature space, allowing for polynomial decision boundaries.

  • Radial Basis Function (RBF) Kernel: \(K(w, b) = \exp(-\gamma |w.x - b|^2)\). Also known as the Gaussian kernel, it maps the features into an infinite-dimensional space, providing a lot of flexibility for non-linear decision boundaries.

Each kernel function has its own set of parameters that need to be tuned for optimal performance. The choice of kernel function and its parameters can significantly impact the SVM model’s ability to capture the underlying patterns in the data.

Margin and Support Vectors

The margin in SVM is defined as the distance between the separating hyperplane and the nearest data points from each class, known as the support vectors. The goal of SVM is to find the hyperplane that maximizes this margin, as a larger margin is associated with better generalization ability of the model.

Support vectors are the data points that lie closest to the decision boundary and are critical in defining the position and orientation of the hyperplane. These are the points that directly influence the shape of the decision boundary, as any small change in their position can alter the hyperplane. The SVM model is said to be “sparse” because only the support vectors contribute to defining the hyperplane, while other data points have no influence.

Objective Function and Optimization

The objective function that SVM optimizes is a combination of maximizing the margin and minimizing the classification error. This is achieved through the minimization of the following objective function:

\(min_{w, b} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{n} \xi_i\)

Subject to the constraints:

\(y_i (w^T x_i + b) \geq 1 - \xi_i \quad \text{and} \quad \xi_i \geq 0 \quad \text{for all } i\)

where \(w\) is the weight vector, \(b\) is the bias term, \(C\) is the regularization parameter, \(\xi_i\) are the slack variables representing the degree of misclassification of the \(i\)-th data point, and \(y_i\) are the class labels.

The hinge loss function is used in SVM to penalize misclassifications. It is defined as:

Hinge loss = \(\max(0, 1 - y_i (w^T x_i + b))\)

The hinge loss is zero for correctly classified points that are outside the margin, and it increases linearly for points that are on the wrong side of the hyperplane or within the margin.

The optimization of the objective function involves finding the values of \(w\) and \(b\) that minimize the function, subject to the constraints. This is typically done using quadratic programming techniques.

Data Preprocessing

In the initial stage of our analysis, we undertook several preprocessing steps to ensure the data was suitable for modeling. Since our dataset did not contain any null values, we focused on encoding categorical variables and scaling numerical features. The categorical variables, such as ‘Country’ and ‘Gender,’ were encoded using one-hot encoding to convert them into a format that could be easily used by our machine learning algorithms. For numerical features like ‘Credit Score,’ ‘Age,’ ‘Tenure,’ ‘Balance,’ and ‘Estimated Salary,’ we applied standard scaling to normalize their distribution, ensuring that no single feature would dominate the model due to its scale.

Exploratory Data Analysis (EDA)

Our Exploratory Data Analysis (EDA) aimed to uncover patterns, detect anomalies, and test hypotheses about our data. We started with summary statistics to understand the central tendency, dispersion, and shape of the dataset’s distributions. For instance, we observed that the Credit Score ranged from 350 to 850, with a median of 659, and the Age of customers varied from 18 to 92 years, with a median age of 37 years.

We then proceeded to visualize the distribution of key variables using distribution plots. This helped us identify the skewness in the ‘Age’ distribution and the uniform distribution of ‘Estimated Salary.’ Pair plots were employed to explore the relationships between variables like ‘Age’ vs. ‘Estimated Salary’ and ‘Age’ vs. ‘Credit Score,’ providing insights into how different factors might influence customer churn.

Through our EDA, we also investigated the distribution of the target variable ‘churn’ across different geographical regions and examined how the number of products varied across different regions. Correlation plots were utilized to identify potential relationships between features, revealing a positive correlation between ‘Age’ and ‘Balance,’ and a negative correlation between ‘NumOfProducts’ and ‘Balance.’

Feature Engineering

In our feature engineering process, we transformed the ‘Gender’ column from categorical to numerical by encoding ‘Male’ as 1 and ‘Female’ as 0. We also applied one-hot encoding to the ‘Geography’ column to convert it into binary variables for each country, ensuring that our model could interpret these categorical features correctly. Additionally, we split our data into training and testing sets to evaluate the performance of our models on unseen data. To address class imbalance in our target variable, we employed the Synthetic Minority Over-sampling Technique (SMOTE), which helped create a more balanced distribution of classes. Finally, we scaled our data using the StandardScaler to ensure that all features contributed equally to the model’s performance, preventing any feature with larger values from dominating the model’s learning process.

Data Analysis

Loading Libraries

Code
library(tidyverse)
library(dplyr)
library(ggplot2)
#install.packages("corrplot")
library(corrplot)
library(caret)
library(smotefamily)
library(ROSE)
library(caret)

Load Data

Code
df <- read.csv("dataset/train.csv")
Code
head(df)
  id CustomerId        Surname CreditScore Geography Gender Age Tenure  Balance
1  0   15674932 Okwudilichukwu         668    France   Male  33      3      0.0
2  1   15749177  Okwudiliolisa         627    France   Male  33      1      0.0
3  2   15694510          Hsueh         678    France   Male  40     10      0.0
4  3   15741417            Kao         581    France   Male  34      2 148882.5
5  4   15766172      Chiemenam         716     Spain   Male  33      5      0.0
6  5   15771669       Genovese         588   Germany   Male  36      4 131778.6
  NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
1             2         1              0       181449.97      0
2             2         1              1        49503.50      0
3             2         1              0       184866.69      0
4             1         1              1        84560.88      0
5             2         1              1        15068.83      0
6             1         1              0       136024.31      1

Summary Statistics

Code
summary(select(df, CreditScore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary))
  CreditScore         Age            Tenure         Balance      
 Min.   :350.0   Min.   :18.00   Min.   : 0.00   Min.   :     0  
 1st Qu.:597.0   1st Qu.:32.00   1st Qu.: 3.00   1st Qu.:     0  
 Median :659.0   Median :37.00   Median : 5.00   Median :     0  
 Mean   :656.5   Mean   :38.13   Mean   : 5.02   Mean   : 55478  
 3rd Qu.:710.0   3rd Qu.:42.00   3rd Qu.: 7.00   3rd Qu.:119940  
 Max.   :850.0   Max.   :92.00   Max.   :10.00   Max.   :250898  
 NumOfProducts   EstimatedSalary    
 Min.   :1.000   Min.   :    11.58  
 1st Qu.:1.000   1st Qu.: 74637.57  
 Median :2.000   Median :117948.00  
 Mean   :1.554   Mean   :112574.82  
 3rd Qu.:2.000   3rd Qu.:155152.47  
 Max.   :4.000   Max.   :199992.48  

Credit Score:

  • The Credit Score ranges from a minimum of 350 to a maximum of 850.

  • The median Credit Score is 659, indicating that half of the customers have a score below 659 and half have a score above.

  • The mean Credit Score is approximately 656.5, suggesting that the average creditworthiness of customers is in the mid-range.

  • The 1st quartile (25th percentile) is 597, and the 3rd quartile (75th percentile) is 710, indicating that 50% of customers have a Credit Score between 597 and 710.

Age:

  • The Age of customers ranges from 18 to 92 years. The median age is 37 years, meaning half of the customers are younger than 37 and half are older.

  • The mean age is approximately 38.13 years, indicating that the average customer is in their late thirties.

  • The distribution of Age is slightly right-skewed, as the mean is slightly higher than the median.

Tenure:

  • Tenure, or the number of years customers have been with the bank, ranges from 0 to 10 years.

  • The median tenure is 5 years, indicating that half of the customers have been with the bank for less than 5 years and half for more.

  • The mean tenure is approximately 5.02 years, suggesting that the average customer has been with the bank for around 5 years.

Balance:

  • The account Balance ranges from a minimum of 0 to a maximum of 250,898.

  • The median balance is 0, indicating that at least half of the customers have no balance in their account.

  • The mean balance is approximately 55,478, suggesting that while many customers have low or zero balances, some have significant amounts in their accounts.

Number of Products:

  • The Number of Products customers have with the bank ranges from 1 to

  • The median number of products is 2, meaning that half of the customers have 2 or fewer products with the bank.

  • The mean number of products is approximately 1.554, indicating that on average, customers have between 1 and 2 products with the bank.

Estimated Salary:

  • The Estimated Salary ranges from a minimum of 11.58 to a maximum of 199,992.48.

  • The median estimated salary is 117,948, suggesting that half of the customers have an estimated salary below this amount and half above.

  • The mean estimated salary is approximately 112,574.82, indicating that the average estimated salary of customers is around 112k.

Count Of Categorical value types

Code
sapply(df[,c('Geography', 'Gender', 'HasCrCard', 'IsActiveMember', 'Exited')], function(x) length(unique(x)))
     Geography         Gender      HasCrCard IsActiveMember         Exited 
             3              2              2              2              2 

Checking null values

Code
colSums(is.na(df))
             id      CustomerId         Surname     CreditScore       Geography 
              0               0               0               0               0 
         Gender             Age          Tenure         Balance   NumOfProducts 
              0               0               0               0               0 
      HasCrCard  IsActiveMember EstimatedSalary          Exited 
              0               0               0               0 

There are no null values in the data.

Distribution of target variable

Code
table(df$Exited)

     0      1 
130113  34921 

We can see number of customers exited are more compared to number of customers not exited. So there is a quite imbalance in data which needs to be addressed while building the model.

Distribution of target variable across Geography.

Code
table(df$Geography, df$Exited)
         
              0     1
  France  78643 15572
  Germany 21492 13114
  Spain   29978  6235

France:

  • A total of 94,215 customers are from France.
  • Out of these, 78,643 customers have not exited the bank (retained),
  • while 15,572 customers have exited (churned).
  • The churn rate for France is approximately 16.53%.

Germany:

  • A total of 34,606 customers are from Germany.
  • Out of these, 21,492 customers have not exited the bank, while 13,114 customers have exited.
  • The churn rate for Germany is approximately 37.89%.

Spain:

  • A total of 36,213 customers are from Spain.
  • Out of these, 29,978 customers have not exited the bank, while 6,235 customers have exited.
  • The churn rate for Spain is approximately 17.21%.

Which Gender has highest Credit Score?

Code
aggregate(df$CreditScore, by = list(df$Gender), FUN = mean)
  Group.1        x
1  Female 656.2437
2    Male 656.6169

Observations:

  • The difference in average credit scores between male and female customers is minimal, indicating that gender does not significantly impact creditworthiness in this dataset.

  • Both genders have an average credit score in the mid-650s, which is considered a fair credit score range.

Distribution of Age.

Code
ggplot(df, aes(x = Age)) + geom_histogram(binwidth = 5, fill = "blue", color = "black")

Observations:

  • The largest concentration of customers falls within the 30 to 40-year-old range, indicating that the majority of customers are in their early to mid-career stages.

  • There is a significant drop in frequency as age increases, especially beyond 50 years. This suggests that the customer base skews younger.

  • The distribution is right-skewed, meaning there are fewer older customers (those over 60) compared to younger customers.

  • There is a small number of customers in the youngest age bracket (under 25 years) and the oldest (over 75 years).

Distribution of Estimated Salary:

Code
ggplot(df, aes(x = EstimatedSalary)) + geom_histogram(binwidth = 5, fill = "blue", color = "black")

Observations:

  • The distribution is quite uniform across different salary ranges, with no distinct peaks that would indicate a concentration of individuals around a specific salary bracket.

  • There are frequent spikes throughout the distribution, which may suggest that the data contains many unique values with small frequencies. This could be indicative of precise salary estimations rather than rounded figures.

  • The salaries range from very low values close to 0 up to 200,000, indicating a diverse group from potentially different economic backgrounds or job roles.

  • There is no obvious concentration of data points around the lower, middle, or upper salary range, which is unusual for income data where one typically expects to see more of a bell-shaped distribution centered around a median salary range.

Comparing the distribution of account balances between customers who have exited and customer who have not exited.

Code
ggplot(df, aes(x = as.factor(Exited), y = Balance)) + geom_boxplot()

Observations:

  • Balance Distribution:

    • The y-axis represents the balance on customer accounts, which seems to range from 0 to a bit over 250,000.

    • Both boxes have a similar interquartile range (IQR), which is the range between the first quartile (25th percentile) and the third quartile (75th percentile), represented by the height of the boxes. This suggests that the middle 50% of balances are similarly distributed between both groups.

    • The median, indicated by the line within each box, is roughly at the same level for both groups, suggesting that the central tendency of balance is similar regardless of whether the customer has exited or not.

  • Outliers:

    • There are visible outliers for both groups, indicated by the points beyond the whiskers of the box plot. These outliers represent customers with balances significantly higher than the general population of the dataset.

How the distribution of the number of products varies across different geographical regions?

Code
ggplot(df, aes(x = Geography, fill = as.factor(NumOfProducts))) + geom_bar(position = "dodge")

Observations:

  1. France:

    • France has the highest count of customers using one product, followed closely by those using two products. The number of customers using three and four products is significantly lower.
  2. Germany:

    • Germany shows a similar pattern to France with one and two products being the most common among customers. However, the count for one product is notably lower than in France, whereas the count for two products is slightly higher.
  3. Spain:

    • Spain’s pattern mirrors that of France and Germany, with one product being the most common, followed by two products. Again, three and four products are used by a considerably smaller number of customers.

Pairplot of Age vs Estimated Salary and also checking which age group and salary range have exited the bank.

Code
ggplot(df, aes(x = Age, y = EstimatedSalary, color = as.factor(Exited))) + geom_point()

Observations:

  1. There doesn’t appear to be a clear pattern or correlation between Age and Estimated Salary with customer churn, as the exited and non-exited customers are interspersed throughout the plot without any distinct clustering.

  2. Customers who have exited are spread across all ages and salary levels, but there seems to be a slightly higher concentration of churned customers in the 40 to 50 age range.

Pairplot of Age vs Credit Score and also checking which age group and Credit Score range have exited the bank.

Code
ggplot(df, aes(x = Age, y = CreditScore , color = as.factor(Exited))) + geom_point()

Observations:

  1. There is a wide distribution of Credit Scores across different ages with no clear pattern indicating that Credit Score by itself may not be a strong predictor of customer exit.

  2. Both exited and non-exited customers are found across the entire range of Credit Scores and Age, but there is a noticeable density of exited customers (blue dots) in the middle age range, particularly between ages 40 and 50.

Pairplot of EstimatedSalary vs Credit Score and also checking what Estimated Salary range and Credit Score range have exited the bank.

Code
ggplot(df, aes(x = EstimatedSalary, y = CreditScore , color = as.factor(Exited))) + geom_point()

Observations:

The scatter plot shows no clear correlation between Credit Score and Estimated Salary in predicting customer churn, with both customers who exited and those who did not evenly dispersed across all ranges of Salary and Credit Scores.

Correlation Plot

Code
corr_matrix <- cor(select(df, CreditScore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary))
corrplot(corr_matrix, method = "circle")

Observations:

There seems to be a noticeable positive correlation between Age and Balance, and a negative correlation between NumOfProducts and Balance.

Churn Rate by Geography

Code
churn_by_country <- df %>%
  group_by(Geography) %>%
  summarise(
    Total_Customers = n(),
    Churned_Customers = sum(Exited),
    Churn_Rate = (sum(Exited) / n()) * 100
  )
Code
print(churn_by_country)
# A tibble: 3 × 4
  Geography Total_Customers Churned_Customers Churn_Rate
  <chr>               <int>             <int>      <dbl>
1 France              94215             15572       16.5
2 Germany             34606             13114       37.9
3 Spain               36213              6235       17.2
Code
ggplot(churn_by_country, aes(x = Geography, y = Churn_Rate, fill = Geography)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(Churn_Rate, 2)), vjust = -0.3) +
  labs(title = "Churn Rate by Country",
       x = "Geography",
       y = "Churn Rate (%)") +
  theme_minimal() +
theme(legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5)) 

Code
df <- df %>% select(-id, -CustomerId, -Surname)
Code
names(df)
 [1] "CreditScore"     "Geography"       "Gender"          "Age"            
 [5] "Tenure"          "Balance"         "NumOfProducts"   "HasCrCard"      
 [9] "IsActiveMember"  "EstimatedSalary" "Exited"         
Code
table(df$Gender)

Female   Male 
 71884  93150 

Python

Code
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
Code
df=pd.read_csv('dataset/train.csv')
Code
df.head()
   id  CustomerId         Surname  ...  IsActiveMember EstimatedSalary Exited
0   0    15674932  Okwudilichukwu  ...             0.0       181449.97      0
1   1    15749177   Okwudiliolisa  ...             1.0        49503.50      0
2   2    15694510           Hsueh  ...             0.0       184866.69      0
3   3    15741417             Kao  ...             1.0        84560.88      0
4   4    15766172       Chiemenam  ...             1.0        15068.83      0

[5 rows x 14 columns]
Code

df.drop(['id','CustomerId','Surname'],axis=1,inplace=True)
Code
df['Gender'] = df['Gender'].replace({'Male': 1, 'Female': 0})
Code
df.head()
   CreditScore Geography  Gender  ...  IsActiveMember  EstimatedSalary  Exited
0          668    France       1  ...             0.0        181449.97       0
1          627    France       1  ...             1.0         49503.50       0
2          678    France       1  ...             0.0        184866.69       0
3          581    France       1  ...             1.0         84560.88       0
4          716     Spain       1  ...             1.0         15068.83       0

[5 rows x 11 columns]

One Hot Encoding

Code
df = pd.get_dummies(df, columns=['Geography'])

Train-Test Split

Code
X = df.drop('Exited', axis=1)
y = df['Exited']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Handling Data Imbalance

Code
smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)

Scaling

Code
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Code
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)
SVC(kernel='linear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
y_pred = svm_model.predict(X_test)
Code
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
Accuracy: 0.8136979661085415
Code
class_report = classification_report(y_test, y_pred)
print('Classification Report:\n', class_report)
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.87      0.88     39133
           1       0.55      0.62      0.58     10378

    accuracy                           0.81     49511
   macro avg       0.72      0.74      0.73     49511
weighted avg       0.82      0.81      0.82     49511

The macro average f1 score is 73 which not that great performing hyper parameter tuning to increase the model performance.

Hpyer Paremeter Tuning using Grid search CV

Code
#param_grid = {
 #   'C': [0.1, 1, 10, 100],  # Regularization parameter
  #  'kernel': ['linear', 'rbf', 'poly'],  # Kernel type
  #  'gamma': ['scale', 'auto'],  # Kernel coefficient for 'rbf', 'poly', and 'sigmoid'
#}
Code
#svm = SVC()
#grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
Code
#grid_search.fit(X_train, y_train)
Code
#print('Best Parameters:', grid_search.best_params_)
#print('Best Score:', grid_search.best_score_)
Code
svm_model = SVC(kernel='poly', C=0.1, degree=3, gamma='scale')
svm_model.fit(X_train, y_train)
SVC(C=0.1, kernel='poly')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
#best_model = grid_search.best_estimator_
y_pred = svm_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
Accuracy: 0.8427016218618085
Code
class_report = classification_report(y_test, y_pred)
print('Classification Report:\n', class_report)
Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.89      0.90     39133
           1       0.61      0.68      0.65     10378

    accuracy                           0.84     49511
   macro avg       0.76      0.78      0.77     49511
weighted avg       0.85      0.84      0.85     49511

References

  1. Jakkula, V. (2006). Tutorial on support vector machine (svm). School of EECS, Washington State University37(2.5), 3.

  2. Kecman, V. (2005). Support vector machines-an introduction. In Support vector machines theory and applications (pp. 1-47). Berlin, Heidelberg: Springer Berlin Heidelberg

  3. Yue, S., Li, P., & Hao, P. (2003). SVM classification: Its contents and challenges. Applied Mathematics-A Journal of Chinese Universities, 18, 332-342.

  4. Jun, Z. (2021). The development and application of support vector machine. In Journal of Physics: Conference Series (Vol. 1748, No. 5, p. 052006). IOP Publishing.

  5. Bhavsar, H., & Panchal, M. H. (2012). A review on support vector machine for data classification. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 1(10), 185-189.

  6. Deris, A. M., Zain, A. M., & Sallehuddin, R. (2011). Overview of support vector machine in modeling machining performances. Procedia Engineering24, 308-312.

  7. Han, Shuo. “Using SVM with Financial Statement Analysis for Prediction of Stocks.” Communications of the IIMA Communications of the IIMA, vol. 7, 2007, scholarworks.lib.csusb.edu/cgi/viewcontent.cgi?article=1059&context=ciima.

  8. Ahmadi, Muhammad Iqbal, et al. “SENTIMENT ANALYSIS ONLINE SHOP on the PLAY STORE USING METHOD SUPPORT VECTOR MACHINE (SVM).” Seminar Nasional Informatika (SEMNASIF), vol. 1, no. 1, 15 Dec. 2020, pp. 196–203, jurnal.upnyk.ac.id/index.php/semnasif/article/view/4101. Accessed 13 Feb. 2024.

  9. Razzaghi, Talayeh, et al. “Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values.” PLOS ONE, vol. 11, no. 5, 19 May 2016, p. e0155119, https://doi.org/10.1371/journal.pone.0155119.

  10. Öz, Ersoy, and Hüseyin Kaya. “Support Vector Machines for Quality Control of DNA Sequencing.” Journal of Inequalities and Applications, vol. 2013, no. 1, 4 Mar. 2013, https://doi.org/10.1186/1029-242x-2013-85. Accessed 15 June 2021.

  11. “Support Vector Machine for Network Intrusion and Cyber-Attack Detection | IEEE Conference Publication | IEEE Xplore.” Ieeexplore.ieee.org, ieeexplore.ieee.org/abstract/document/8233268. Accessed 13 Feb 2024.

  12. Kumar, Sachin, et al. “Precision Sugarcane Monitoring Using SVM Classifier.” Procedia Computer Science, vol. 122, 2017, pp. 881–887, https://doi.org/10.1016/j.procs.2017.11.450. Accessed 25 July 2019.

  13. Javeed, A. et al. (2023) Early prediction of dementia using feature Extraction Battery (FEB) and optimized support vector machine (SVM) for Classification, MDPI. Available at: https://www.mdpi.com/2227-9059/11/2/439 (Accessed: 22 January 2024).

  14. Nawal, Y., Oussalah, M., Fergani, B., & Fleury, A. (2022). New incremental SVM algorithms for human activity recognition in smart homes. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-022-03798-w

  15. Zhang, L., Hu, H., & Zhang, D. (2015). A credit risk assessment model based on SVM for small and medium enterprises in supply chain finance. Financial Innovation1(1). https://doi.org/10.1186/s40854-015-0014-5

  16. Harimoorthy, K., Thangavelu, M. RETRACTED ARTICLE: Multi-disease prediction model using improved SVM-radial bias technique in healthcare monitoring system. J Ambient Intell Human Comput 12, 3715–3723 (2021). https://doi.org/10.1007/s12652-019-01652-0

  17. J. Liang, Z. Qin, L. Xue, X. Lin and X. Shen, “Verifiable and Secure SVM Classification for Cloud-Based Health Monitoring Services,” in IEEE Internet of Things Journal, vol. 8, no. 23, pp. 17029-17042, 1 Dec.1, 2021, doi: 10.1109/JIOT.2021.3075540.

  18. G. N. Ahmad, H. Fatima, S. Ullah, A. Salah Saidi and Imdadullah, “Efficient Medical Diagnosis of Human Heart Diseases Using Machine Learning Techniques With and Without GridSearchCV,” in IEEE Access, vol. 10, pp. 80151-80173, 2022, doi: 10.1109/ACCESS.2022.3165792.

  19. Sahara, S., Annida Purnamawati, Sulaeman Hadi Sukmana, Mely Mailasari, Erma Delima Sikumbang, & Puji, E. (2023). PSO optimization for analysis of online marketplace products on the SVM method. AIP Conference Proceedings. https://doi.org/10.1063/5.0129404

  20. “Prediction of Consumer Purchasing in a Grocery Store Using Machine Learning Techniques.” Ieeexplore.ieee.org, ieeexplore.ieee.org/document/7941935.

  21. Barakat, Nahla, et al. “Intelligible Support Vector Machines for Diagnosis of Diabetes Mellitus.” IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 4, July 2010, pp. 1114–1120, https://doi.org/10.1109/titb.2009.2039485.

  22. “Applying Support Vector Machine to Electronic Health Records for Cancer Classification | IEEE Conference Publication | IEEE Xplore.” Ieeexplore.ieee.org, ieeexplore.ieee.org/abstract/document/8732906.

  23. “An Effective Intrusion Detection Approach Using SVM with Naïve Bayes Feature Embedding.” Computers & Security, vol. 103, 1 Apr. 2021, p. 102158, www.sciencedirect.com/science/article/pii/S0167404820304314, https://doi.org/10.1016/j.cose.2020.102158.

  24. Hosseini, Soodeh, and Behnam Mohammad Hasani Zade. “New Hybrid Method for Attack Detection Using Combination of Evolutionary Algorithms, SVM, and ANN.” Computer Networks, vol. 173, May 2020, p. 107168, https://doi.org/10.1016/j.comnet.2020.107168.